[AURON #2236] Support compile with Celeborn 0.6 and spark 4.0#2243
[AURON #2236] Support compile with Celeborn 0.6 and spark 4.0#2243cxzl25 merged 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Spark 4.0 compilation support when building Auron with Celeborn 0.6 by updating Spark-version-gated APIs and aligning Spark 4’s shuffle-write execution behavior with existing Spark 3.x assumptions.
Changes:
- Extend
@sparkvergating forgetPartitionLengths()to include Spark 4.0 in Celeborn 0.6 and RSS shuffle writer implementations. - Update
NativeRDD.compute()to also defer execution forRSS_SHUFFLE_WRITERplans on Spark 4+ (to avoid early execution under the refined ShuffleWriteProcessor flow).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| thirdparty/auron-celeborn-0.6/src/main/scala/org/apache/spark/sql/execution/auron/shuffle/celeborn/AuronCelebornShuffleWriter.scala | Adds Spark 4.0 to the version-gated getPartitionLengths() override for Celeborn 0.6 integration. |
| spark-extension/src/main/scala/org/apache/spark/sql/execution/auron/shuffle/AuronRssShuffleWriterBase.scala | Adds Spark 4.0 to the version-gated getPartitionLengths() override for RSS shuffle writer base. |
| spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeRDD.scala | Extends the Spark 4+ shuffle-write deferral logic to cover RSS shuffle writer plans as well. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
@ftong2020 Thanks for the contribution! Please update the PR title to include the issue ID: |
sure |
The newly added parameters of def apply(
loc: BlockManagerId,
uncompressedSizes: Array[Long],
mapTaskId: Long,
checksumVal: Long = 0): MapStatus = { |
I tried it, it is celeborn-client itself that does not support spark 4.1. Spark will throw following exception no matter auron is enabled or not. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (4a802e9ef77a executor driver): java.lang.NoSuchMethodError: 'org.apache.spark.scheduler.MapStatus org.apache.spark.scheduler.MapStatus$.apply(org.apache.spark.storage.BlockManagerId, long[], long)' Driver stacktrace: |
Sorry, I forgot, Celeborn needs to be released in version 0.7.0 to support Spark4.1. |
Which issue does this PR close?
Closes #2236
Rationale for this change
Spark 4.0 has been release for over 1 year, and celeborn 0.6 provides official spark 4.0 support.
This change will add spark 4.0 support when auron is compiled with spark 4.0 and celeborn 0.6.
Spark 4.1 supported is not included in this PR since latest celeborn version does not support this configuration. Spark 4.1 also changed signature of Mapstatus.apply(), which demands heavy changes to codebase.
What changes are included in this PR?
Are there any user-facing changes?
no
How was this patch tested?
Tested in our staging env